Uber Eats USA Restaurants Analysis¶

Rhea Jajodia, Leah Kannan, Jasneet Kaur¶

Introduction¶

As college students who began their undergraduate studies in the midst of a pandemic, we witnessed the rise of many new businesses and services. During the pandemic, people were encouraged to stay indoors and avoid public gatherings, so naturally, services which allowed consumers to obtain what they needed without leaving the comfort of their homes rose in popularity. Uber Eats is a service which falls into this category. It allows people to order food from their home by simply using their smartphone to choose a restaurant, enter their address, and pay the fee for the food and delivery. A Uber Eats driver will drive to the store, pick up the food, and deliver it to the front door of the customer. The consumer can enjoy a full meal without having to cook or leave their home, which people considered the safer option during the Covid 19 outbreak.

Today, as society works towards a post-pandemic world, we still see the resilience of services like UberEats. People still prefer the convenience of not having to leave their homes for food. Now, the motivation could be anything from not having transportation options to simply not wanting to make a meal after a long day. The service of having food delivered to your doorstep will most likely be in demand for a while.

For our final CMSC320 project, we decided to explore and analyze Uber Eats data to determine many different relationships different components of restaurants have with each other and the consumers that choose to give them their business.

In [ ]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [ ]:
import pandas as pd
import re
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import folium
from folium import plugins
from folium.plugins import HeatMap
from sklearn import linear_model
from sklearn import datasets
from sklearn.linear_model import LinearRegression 
from sklearn.model_selection import train_test_split  
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from scipy import stats
from sklearn.model_selection import train_test_split, KFold, cross_val_score

We started off using the raw data provided by “Uber Eats Exploration” on the platform Kaggle. This dataset provides us with information about various restaurants that use the platform.

In [ ]:
df = pd.read_csv('/content/drive/MyDrive/restaurants.csv')
df
Out[ ]:
id position name score ratings category price_range full_address zip_code lat lng
0 1 19 PJ Fresh (224 Daniel Payne Drive) NaN NaN Burgers, American, Sandwiches $ 224 Daniel Payne Drive, Birmingham, AL, 35207 35207 33.562365 -86.830703
1 2 9 J' ti`'z Smoothie-N-Coffee Bar NaN NaN Coffee and Tea, Breakfast and Brunch, Bubble Tea NaN 1521 Pinson Valley Parkway, Birmingham, AL, 35217 35217 33.583640 -86.773330
2 3 6 Philly Fresh Cheesesteaks (541-B Graymont Ave) NaN NaN American, Cheesesteak, Sandwiches, Alcohol $ 541-B Graymont Ave, Birmingham, AL, 35204 35204 33.509800 -86.854640
3 4 17 Papa Murphy's (1580 Montgomery Highway) NaN NaN Pizza $ 1580 Montgomery Highway, Hoover, AL, 35226 35226 33.404439 -86.806614
4 5 162 Nelson Brothers Cafe (17th St N) 4.7 22.0 Breakfast and Brunch, Burgers, Sandwiches NaN 314 17th St N, Birmingham, AL, 35203 35203 33.514730 -86.811700
... ... ... ... ... ... ... ... ... ... ... ...
40222 40223 54 Mangia la pasta! (5610 N Interstate Hwy 35) 4.8 500.0 Pasta, Comfort Food, Italian, Group Friendly $ 5610 N I35, Austin, TX, 78751 78751 30.316248 -97.708441
40223 40224 53 Wholly Cow Burgers (S Lamar) 4.6 245.0 American, Burgers, Breakfast and Brunch, Aller... $ 3010 S Lamar Blvd, Austin, TX, 78704 78704 30.242816 -97.783821
40224 40225 52 EurAsia Ramen 3 4.7 293.0 Sushi, Asian, Japanese, Exclusive to Eats, Gro... $ 5222 Burnet Road, Austin, TX, 78756 78756 30.324290 -97.740200
40225 40226 51 Austin's Habibi (5th St) 4.7 208.0 Mediterranean, Gluten Free Friendly, Allergy F... $$ 817 W 5th St, Austin, TX, 78703 78703 30.269580 -97.753110
40226 40227 50 Beijing Wok 4.4 254.0 Chinese, Asian, Asian Fusion, Family Friendly,... $ 8106 Brodie Ln, Austin, TX, 78749 78749 30.202210 -97.838689

40227 rows × 11 columns

For each entry, the dataset stores a unique id, the restaurant’s position in the search list, its name, rating, number of ratings, category tags, price range, full address, zip code, and the latitude and longitude. However, some of these columns – mainly latitude and longitude – store missing values as 0 instead of null, which means that even though it seems like there’s a latitude and longitude given for every entry, there are actually some rows with missing values for those fields.

In [ ]:
df.count()
Out[ ]:
id              40227
position        40227
name            40227
score           22254
ratings         22254
category        40204
price_range     33581
full_address    39949
zip_code        39940
lat             40227
lng             40227
dtype: int64

We then decided to rename the columns “score” and “ratings” to “rating” and “number_of_ratings” to improve readability and to avoid potential confusion between the two columns.

In [ ]:
# rename the columns so that the column names are more clear
df.rename(columns={'score': 'rating', 'ratings': 'number_of_ratings'}, inplace=True)
df.head()
Out[ ]:
id position name rating number_of_ratings category price_range full_address zip_code lat lng
0 1 19 PJ Fresh (224 Daniel Payne Drive) NaN NaN Burgers, American, Sandwiches $ 224 Daniel Payne Drive, Birmingham, AL, 35207 35207 33.562365 -86.830703
1 2 9 J' ti`'z Smoothie-N-Coffee Bar NaN NaN Coffee and Tea, Breakfast and Brunch, Bubble Tea NaN 1521 Pinson Valley Parkway, Birmingham, AL, 35217 35217 33.583640 -86.773330
2 3 6 Philly Fresh Cheesesteaks (541-B Graymont Ave) NaN NaN American, Cheesesteak, Sandwiches, Alcohol $ 541-B Graymont Ave, Birmingham, AL, 35204 35204 33.509800 -86.854640
3 4 17 Papa Murphy's (1580 Montgomery Highway) NaN NaN Pizza $ 1580 Montgomery Highway, Hoover, AL, 35226 35226 33.404439 -86.806614
4 5 162 Nelson Brothers Cafe (17th St N) 4.7 22.0 Breakfast and Brunch, Burgers, Sandwiches NaN 314 17th St N, Birmingham, AL, 35203 35203 33.514730 -86.811700

To analyze the data using location as a variable, we decided to create a new column called “State” that would contain each entry’s state location stored as its two-letter abbreviation. In order to do this, we first used the “full_adress” of each state and used a regular expression to pull the two-letter abbreviation if it was in the address. If it was not in the address, the State value would be set to None.

In [ ]:
# making new column with state data
lst = []
for a in df["full_address"]:
    result = re.search('^(([^,]+,)+) ([A-Z]{2}),', str(a))
    result2 = re.search('^(([^,]+,)+)(,|\s,) ([A-Z]{2})(,|)', str(a))
    if result:
      lst.append(result.group(3))
    elif result2:
      lst.append(result2.group(4))
    else:
      lst.append(None)
df["State"] = lst

# list unique states in the dataset
df["State"].unique()
Out[ ]:
array(['AL', None, 'WY', 'WI', 'MN', 'IL', 'WV', 'OH', 'WA', 'OR', 'ID',
       'VA', 'DC', 'MD', 'TN', 'VT', 'UT', 'PR', 'TX'], dtype=object)

We then tried to fill in these missing entries by using the zip codes given in the dataset to find the location’s state. To do this, we downloaded a zip code dataset containing location information for zip codes in the United States.

In [ ]:
# load in zip code data set
zipcodes_df = pd.read_csv('/content/drive/MyDrive/free-zipcode-database-Primary.csv')
zipcodes_df.head()
Out[ ]:
Zipcode ZipCodeType City State LocationType Lat Long Location Decommisioned TaxReturnsFiled EstimatedPopulation TotalWages
0 705 STANDARD AIBONITO PR PRIMARY 18.14 -66.26 NA-US-PR-AIBONITO False NaN NaN NaN
1 610 STANDARD ANASCO PR PRIMARY 18.28 -67.14 NA-US-PR-ANASCO False NaN NaN NaN
2 611 PO BOX ANGELES PR PRIMARY 18.28 -66.79 NA-US-PR-ANGELES False NaN NaN NaN
3 612 STANDARD ARECIBO PR PRIMARY 18.45 -66.73 NA-US-PR-ARECIBO False NaN NaN NaN
4 601 STANDARD ADJUNTAS PR PRIMARY 18.16 -66.72 NA-US-PR-ADJUNTAS False NaN NaN NaN

We then parsed through it to find state data for any entry with a zip code, storing this in a column called “state_by_zip”.

In [ ]:
# initialize new row
row = []

# iterate through the zip_code column in uber eats dataframe
for zips in df['zip_code']:
  # make sure zip code is only 5 digits
  zip = str(zips)[0:5]

  # check for non-numeric values
  if zip.isnumeric():
    # Gets the row where the zip codes match and gets State data
    l = (zipcodes_df.loc[zipcodes_df['Zipcode'] == int(zip)])["State"]

    # if l has information and is not empty
    if len((pd.Series(list(l)).values)) > 0:
      # add state data to row
      row.append((pd.Series(list(l)).values)[0])

    # else add nan to row
    else:
      row.append(np.nan)
  
  # handles non-numeric values
  else:
    row.append(np.nan)
    
# check that len of row is same as the dataframe (len = 40227)
len(row)
<ipython-input-8-ea2c5bf95b98>:15: DeprecationWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
  if len((pd.Series(list(l)).values)) > 0:
Out[ ]:
40227

We then viewed the data by state, and notice there were more unique states in the zip code data than the data pulled from the regular expression.

In [ ]:
# Add in updated state data
df["state_by_zip"] = row

# View count data by State
df.groupby("state_by_zip").count()
Out[ ]:
id position name rating number_of_ratings category price_range full_address zip_code lat lng State
state_by_zip
AL 1102 1102 1102 524 524 1101 944 1102 1102 1102 1102 1102
AZ 1 1 1 1 1 1 1 1 1 1 1 1
CA 5 5 5 2 2 5 4 5 5 5 5 5
CT 5 5 5 3 3 5 5 5 5 5 5 5
DC 1508 1508 1508 907 907 1508 1099 1508 1508 1508 1508 1508
FL 2 2 2 1 1 2 2 2 2 2 2 2
GA 1 1 1 1 1 1 1 1 1 1 1 1
ID 25 25 25 8 8 25 21 25 25 25 25 25
IL 208 208 208 127 127 208 188 208 208 208 208 208
IN 2 2 2 1 1 2 2 2 2 2 2 2
MA 1 1 1 0 0 1 1 1 1 1 1 1
MD 897 897 897 619 619 897 713 897 897 897 897 897
MN 44 44 44 28 28 44 41 44 44 44 44 44
MO 1 1 1 0 0 1 1 1 1 1 1 1
MT 1 1 1 0 0 1 1 1 1 1 1 1
NE 1 1 1 1 1 1 1 1 1 1 1 1
NH 5 5 5 0 0 5 5 5 5 5 5 5
NJ 2 2 2 1 1 2 2 2 2 2 2 2
NY 5 5 5 1 1 5 5 5 5 5 5 5
OH 17 17 17 4 4 17 17 17 17 17 17 17
OR 1023 1023 1023 522 522 1022 707 1023 1023 1023 1023 1023
PA 3 3 3 0 0 3 3 3 3 3 3 3
PR 201 201 201 142 142 201 148 201 201 201 201 32
SC 2 2 2 0 0 2 2 2 2 2 2 2
TN 42 42 42 6 6 42 39 42 42 42 42 42
TX 7259 7259 7259 4317 4317 7258 5813 7259 7259 7259 7259 7258
UT 3071 3071 3071 1597 1597 3071 2545 3071 3071 3071 3071 3071
VA 9248 9248 9248 5581 5581 9238 7966 9248 9248 9248 9248 9248
VT 347 347 347 79 79 347 329 347 347 347 347 347
WA 8883 8883 8883 5574 5574 8883 7264 8883 8883 8883 8883 8883
WI 4307 4307 4307 1681 1681 4307 3886 4307 4307 4307 4307 4307
WV 1373 1373 1373 322 322 1372 1322 1373 1373 1373 1373 1373
WY 320 320 320 75 75 320 307 320 320 320 320 320

This observation led to the discovery that some of the zip codes in the dataset were not correct and did not correspond to the address that was listed. Fortunately, the state abbreviations listed in “full_address” were correct, so we used “state_in_zip” to fill some missing entries that the regular expression did not fill, and made sure there weren’t many wrong entries.

In [ ]:
# make new col
new_state = [];

# add in the states found by zip code into the dataset (but leave in original data)
for i in range (0, len(df['State'])):
  if df['State'][i]==0:
    new_state.append(df['state_by_zip'][i])
  else:
    new_state.append(df['State'][i])
df['State'] = new_state

We then viewed the spread of the data by state.

In [ ]:
# see the data spread per state
df.groupby("State").count()
Out[ ]:
id position name rating number_of_ratings category price_range full_address zip_code lat lng state_by_zip
State
AL 1107 1107 1107 526 526 1106 949 1107 1107 1107 1107 1107
DC 1511 1511 1511 907 907 1511 1100 1511 1511 1511 1511 1511
ID 27 27 27 8 8 27 23 27 27 27 27 27
IL 204 204 204 126 126 204 184 204 204 204 204 204
MD 895 895 895 619 619 895 712 895 895 895 895 895
MN 43 43 43 27 27 43 39 43 43 43 43 43
OH 15 15 15 3 3 15 15 15 15 15 15 15
OR 1024 1024 1024 523 523 1023 708 1024 1024 1024 1024 1023
PR 32 32 32 31 31 32 30 32 32 32 32 32
TN 42 42 42 6 6 42 39 42 42 42 42 42
TX 7272 7272 7272 4324 4324 7266 5820 7272 7267 7272 7272 7263
UT 3085 3085 3085 1603 1603 3085 2557 3085 3085 3085 3085 3075
VA 9264 9264 9264 5586 5586 9254 7983 9264 9264 9264 9264 9262
VT 347 347 347 79 79 347 329 347 347 347 347 347
WA 8895 8895 8895 5580 5580 8891 7270 8895 8891 8895 8895 8887
WI 4317 4317 4317 1686 1686 4317 3897 4317 4317 4317 4317 4311
WV 1379 1379 1379 322 322 1378 1328 1379 1379 1379 1379 1379
WY 319 319 319 75 75 319 306 319 319 319 319 319

Next, we will clean up the category data field.

In [ ]:
df['category'].nunique()
Out[ ]:
10647

When we count all the unique values of categories we are given 10,647 different categories. When looking back at the category list, we can see this is because the cuisines are listed as strings and can have many different pairings of key words in many different orders. We want to clean this data in order for it to be usable in more analysis.

First, lets take a look at the most common category names.

In [ ]:
category_counts = df['category'].value_counts().to_frame()

category_counts.head(15)
Out[ ]:
category
Burgers, American, Sandwiches 1619
Mexican, Latin American, New Mexican 1168
Fast Food, Sandwich, American 838
Pizza, American, Italian 714
American, Burgers, Fast Food 686
American, Burgers, Sandwiches 484
Burritos, Fast Food, Mexican 430
Coffee and Tea, American, Breakfast and Brunch 410
Chinese, Asian, Asian Fusion 366
American, burger, Fast Food 360
Pharmacy, Convenience, Everyday Essentials, Baby 326
American, burger, Fast Food, Family Meals 313
Breakfast and Brunch, American, Sandwiches 286
Bakery, Breakfast and Brunch, Cafe, Coffee &amp; Tea 276
Sandwiches, American, Healthy 270

When looking at the top 15 categories, we can already see lots of overlap in the types of food, just under different names and labels. For example, 'burgers' are part of 5 separate categories listed since the strings can have lots of variation.

Due to this, we will do some cleaning to make more general categories that will have restaurants of the same cuisines in them.

In order to find out which general categories we should sort them into, we found some reports from Uber on the most popular cuisines. (Found at this link https://www.uber.com/newsroom/the-2021-uber-eats-cravings-report/)

From that report the listings of most popular cuisines are as follows:

The most popular cuisines:

  1. Mexican
  2. Burgers + Sandwiches
  3. Chinese
  4. Indian
  5. Pizza
  6. Sushi
  7. Thai
  8. Mediterranean
  9. Breakfast (Bagels + Donuts)
  10. Vietnamese

Since theses are the most popular cuisines on the app, we figured that the UberEats app would have a relatively large number of these types of restaurants.

In [ ]:
# changing category type to string so that we can do manipulations later
df['category'] = df['category'].astype(str)
In [ ]:
# sort categories
df['gen_category'] = df['category'].apply(lambda x: 'American' if "burger" in x.lower() else
                                          'Mexican' if "mexican" in x.lower() else 
                                          'Mexican' if "taco" in x.lower() else
                                          'Mexican' if "burrito" in x.lower() else
                                          'Chinese' if "chinese" in x.lower() else 
                                          'Indian' if "indian" in x.lower() else 
                                          'Pizza' if "pizza" in x.lower() else
                                          'Japanese' if "sushi" in x.lower() else
                                          'Japanese' if "japanese" in x.lower() else
                                          'Thai' if "thai" in x.lower() else
                                          'Mediterranean' if "mediterranean" in x.lower() else
                                          'Breakfast' if "breakfast" in x.lower() else
                                          'Breakfast' if "bagel" in x.lower() else
                                          'Breakfast' if "donut" in x.lower() else
                                          'Vietnamese' if "vietnamese" in x.lower() else
                                          'American' if "american" in x.lower() else 
                                          'American' if "sandwich" in x.lower() else
                                          'Convenience' if "convenience" in x.lower() else
                                          'Korean' if 'korean' in x.lower() else
                                          'Asian' if 'asian' in x.lower() else
                                          'Italian' if 'italian' in x.lower() else
                                          'Dessert' if 'dessert' in x.lower() else
                                          'Dessert' if 'ice cream' in x.lower() else
                                          'Vegetarian' if 'vegetarian' in x.lower() else
                                          x)
# np.nan
# ("burger"||"american"||"sandwich")

To categorize the entries into more generic and overlapping categories, we looked for key words in the category string entries and then assigned them a general category name in the row gen_category. For example, if we found "burger" or "american" in the category, we assigned them to the "American" category.

We initally only made categories for the 10 listed above, but when displaying the categories created and their counts (as shown below in the table) there were still overlapping category key strings that could be combined. Due to this, we decided to make more categories to hold the types of stores. Some example of the ones added were Italian and Convinience.

In [ ]:
better_cat = df['gen_category'].value_counts().to_frame()
better_cat.head(20)
Out[ ]:
gen_category
American 15531
Mexican 4058
Pizza 3936
Breakfast 3222
Convenience 1750
Chinese 1675
Dessert 1558
Japanese 1390
Indian 1011
Mediterranean 778
Thai 717
Korean 578
Asian 515
Italian 514
Vietnamese 417
Vegetarian 333
Juice and Smoothies, Healthy, Fast Food 78
Retail, Gift Store, Beauty Supply 75
Alcohol, Liquor Stores, Wine 64
Halal, Chicken, Middle Eastern 36
In [ ]:
df['gen_category'].nunique()
Out[ ]:
986

For the entries that have not been added to the general categories, we will put NaN values in to make it cleaner when doing data processing by category. We are left with 16 categories as displayed below.

In [ ]:
# sort categories
df['gen_category'] = df['category'].apply(lambda x: 'American' if "burger" in x.lower() else
                                          'Mexican' if "mexican" in x.lower() else 
                                          'Mexican' if "taco" in x.lower() else
                                          'Mexican' if "burrito" in x.lower() else
                                          'Chinese' if "chinese" in x.lower() else 
                                          'Indian' if "indian" in x.lower() else 
                                          'Pizza' if "pizza" in x.lower() else
                                          'Japanese' if "sushi" in x.lower() else
                                          'Japanese' if "japanese" in x.lower() else
                                          'Thai' if "thai" in x.lower() else
                                          'Mediterranean' if "mediterranean" in x.lower() else
                                          'Breakfast' if "breakfast" in x.lower() else
                                          'Breakfast' if "bagel" in x.lower() else
                                          'Breakfast' if "donut" in x.lower() else
                                          'Vietnamese' if "vietnamese" in x.lower() else
                                          'American' if "american" in x.lower() else 
                                          'American' if "sandwich" in x.lower() else
                                          'Korean' if 'korean' in x.lower() else
                                          'Asian' if 'asian' in x.lower() else
                                          'Italian' if 'italian' in x.lower() else
                                          'Dessert' if 'dessert' in x.lower() else
                                          'Dessert' if 'ice cream' in x.lower() else
                                          'Vegetarian' if 'vegetarian' in x.lower() else
                                          'Convenience' if "convenience" in x.lower() else
                                          np.nan)
In [ ]:
better_cat = df['gen_category'].value_counts().to_frame()
better_cat.head(16)
Out[ ]:
gen_category
American 15531
Mexican 4058
Pizza 3936
Breakfast 3222
Convenience 1720
Chinese 1675
Dessert 1581
Japanese 1390
Indian 1011
Mediterranean 778
Thai 717
Korean 580
Asian 517
Italian 514
Vietnamese 417
Vegetarian 336

Analysis of Ratings

Next we want to take a look at the ratings of the restaurants and see what patterns can be uncovered. We first plotted the rating score versus the number of ratings. From this scatter plot, we can see that the data is skewed left so most restaurants that have a lot of ratings tend to be on the higher end of the rating score. This was interesting data to be seen by itself but we then decided to see if there was any difference in patterns for ratings and number of ratings by each price point.

In [ ]:
# set size of plot
plt.figure(figsize=(12, 8))

font1 = {'size':20}
font2 = {'size':15}

#set the title
plt.title("Rating Score versus Number of Ratings", fontdict = font1)

# naming the x and y axis
plt.xlabel('Rating Score', fontdict = font2)
plt.ylabel('Number of Ratings', fontdict = font2)

# plot the scatter of rating vs number of ratings
plt.scatter(df.rating, df.number_of_ratings)
Out[ ]:
<matplotlib.collections.PathCollection at 0x7f549453a850>

Before we split the price points up, we first wanted to see how many restaurants each price point contained. The data is sectioned off into 4 price points: $, $, $$$, and $$$$. To make this a bit simpler to follow and talk about we will be referring to the price points as $: low, $$: medium, $$$: high, and $$$$: extremely high. From this data we see that over half of the restaurants in the dataset are in the low price point, the medium price point has less than half of those and the high and extremely high price points do not have many restaurants on UberEats. This pattern is not too surprising since a lot of people use these apps to get fast food delivered to them and those would most likely be at the lowest price point. Also, if a restaurant has a higher price point, it is more likely to be a dine in restaurant and more people tend to go in person and not use the food delivery apps.

Before we split the price points up, we first wanted to see how many restaurants each price point contained. The data is sectioned off into 4 price points: $, $, $$$, and $$$$. To make this a bit simpler to follow and talk about we will be referring to the price points as $: low, $$: medium, $$$: high, and $$$$: extremely high. From this data we see that over half of the restaurants in the dataset are in the low price point, the medium price point has less than half of those and the high and extremely high price points do not have many restaurants on UberEats. This pattern is not too surprising since a lot of people use these apps to get fast food delivered to them and those would most likely be at the lowest price point. Also, if a restaurant has a higher price point, it is more likely to be a dine in restaurant and more people tend to go in person and not use the food delivery apps.

In [ ]:
# Look at price range
df['price_range'].unique()
df['price_range'].value_counts()
Out[ ]:
$       24385
$$       9029
$$$       149
$$$$       18
Name: price_range, dtype: int64

Moving on, we section off the datasets by price points and then plot their respective rating scores versus number of ratings.

In [ ]:
prices = df.groupby('price_range')

prices['rating'].count()

low = prices.get_group('$')
med = prices.get_group('$$')
high = prices.get_group('$$$')
top = prices.get_group('$$$$')
In [ ]:
#low plot
# set size of plot
plt.figure(figsize=(8, 6))

font1 = {'size':20}
font2 = {'size':15}

#set the title
plt.title("Rating Score versus Number of Ratings for Low Price Range", fontdict = font1)

# naming the x and y axis
plt.xlabel('Rating Score', fontdict = font2)
plt.ylabel('Number of Ratings', fontdict = font2)

# plot the scatter of rating vs number of ratings
plt.scatter(low.rating, low.number_of_ratings)



### Medium plot
# set size of plot
plt.figure(figsize=(8, 6))

font1 = {'size':20}
font2 = {'size':15}

#set the title
plt.title("Rating Score versus Number of Ratings for Medium Price Range", fontdict = font1)

# naming the x and y axis
plt.xlabel('Rating Score', fontdict = font2)
plt.ylabel('Number of Ratings', fontdict = font2)

# plot the scatter of rating vs number of ratings
plt.scatter(med.rating, med.number_of_ratings)


### High plot
# set size of plot
plt.figure(figsize=(8, 6))

font1 = {'size':20}
font2 = {'size':15}

#set the title
plt.title("Rating Score versus Number of Ratings for High Price Range", fontdict = font1)

# naming the x and y axis
plt.xlabel('Rating Score', fontdict = font2)
plt.ylabel('Number of Ratings', fontdict = font2)

# plot the scatter of rating vs number of ratings
plt.scatter(high.rating, high.number_of_ratings)


### Top plot
# set size of plot
plt.figure(figsize=(8, 6))

font1 = {'size':20}
font2 = {'size':15}

#set the title
plt.title("Rating Score versus Number of Ratings for Highest Price Range", fontdict = font1)

# naming the x and y axis
plt.xlabel('Rating Score', fontdict = font2)
plt.ylabel('Number of Ratings', fontdict = font2)

# plot the scatter of rating vs number of ratings
plt.scatter(top.rating, top.number_of_ratings)
Out[ ]:
<matplotlib.collections.PathCollection at 0x7f5493fc0d90>

Looking at these plots sectioned by price point, we see the dataset densities get much lower as the price point increases which aligns with the number of restaurants at each price point from before. The low, medium, and high plots have the same general trend of being skewed left so they tend to have higher ratings and a higher number of ratings. The low price point has a higher density of points near the top right of the graph. This means that there are more highly rated restaurants with a greater number of ratings. This could be due to more people ordering from those restaurants and therefore more people writing reviews. There are much fewer restaurants in the high and extremely high plots but the trend is the same for the high plot. The extremely high price range plot does not seem to have any patterns since the plots are scattered far apart from each other.

Below, we added another representation of the plots by price point below.

In [ ]:
fig = plt.figure(figsize=(12, 8))
ax1 = fig.add_subplot(111)

ax1.scatter(low.rating, low.number_of_ratings, s=10, c='b', marker="s", label='low')
ax1.scatter(med.rating, med.number_of_ratings, s=10, c='r', marker="o", label='med')
ax1.scatter(high.rating, high.number_of_ratings, s=10, c='c', marker="s", label='high')
ax1.scatter(top.rating, top.number_of_ratings, s=10, c='m', marker="o", label='top')
plt.legend(loc='upper left')
plt.show()

Next we decided to take a look at the best restaurants in the dataset. We decided the best restaurants would be the highest rated restaurants with the highest number of ratings as well. We sorted the restaurants by the two columns mentioned above and we show the top 20 restaurants below.

In [ ]:
# try to find the best rated retaurants by rating and number of ratings
highly_rated_restaurant = df.sort_values(['rating','number_of_ratings'], ascending=False)
highly_rated_restaurant.head(20)
Out[ ]:
id position name rating number_of_ratings category price_range full_address zip_code lat lng State state_by_zip gen_category
18401 18402 68 Starbucks (S. Van Dorn and Pickett) 5.0 223.0 Cafe, Coffee &amp; Tea, Breakfast and Brunch, ... $ 5782 Dow Ave, Alexandria, VA, 22304 22304 38.804558 -77.132929 VA VA Breakfast
28607 28608 169 Sundevich 5.0 176.0 Salads, American, Vegetarian, Sandwich NaN 601 New Jersey Ave. NW, Washington, DC, 20001 20001 38.897830 -77.011590 DC DC American
23134 23135 35 Berries &amp; Bowls 5.0 156.0 Juice and Smoothies, Healthy, Vegetarian $ 120 Market St, Gaithersburg, MD, 20878 20878 39.122270 -77.234758 MD MD Vegetarian
22901 22902 15 Starbucks (South Riding Blvd) 5.0 137.0 Cafe, Coffee &amp; Tea, Breakfast and Brunch, ... $ 43114 Peacock Market #140, South Riding, VA, 2... 20152 38.915668 -77.511693 VA VA Breakfast
19926 19927 86 Open Road (ROSSLYN) 5.0 136.0 Burgers, American, Sandwiches $ 1201 Wilson Boulevard, Arlington, VA, 22209 22209 38.895720 -77.071040 VA VA American
35369 35370 10 Cafe Vida (Rogers Ranch) 5.0 114.0 Breakfast and Brunch, Healthy, Latin American,... $ 2711 Treble Creek, San Antonio, TX, 78258 78258 29.604498 -98.537205 TX TX Breakfast
11950 11951 241 Banh Mi Up 5.0 112.0 Vietnamese, Noodles, Healthy $ 8037 N Lombard St, Portland, OR, 97203 97203 45.589600 -122.748510 OR OR Vietnamese
19814 19815 14 South Block (Falls Church) 5.0 111.0 Juice and Smoothies, Healthy, American $ 2121 N Westmoreland St, Arlington, VA, 22213 22213 38.886520 -77.161690 VA VA American
9333 9334 3 Teriyaki Plus 5.0 110.0 Japanese: Other, Asian, Sushi, Family Friendly $ 11512 124th Ave NE, Kirkland, WA, 98033 98033 47.703596 -122.175306 WA WA Japanese
27298 27299 3 Starbucks (9002 W. Broad Street) 5.0 110.0 Bakery, Cafe $ 9002 W. Broad Street, Richmond, VA, 23294 23294 37.635630 -77.547270 VA VA NaN
12197 12198 65 Kolby's Donut House 5.0 108.0 Bakery, Desserts, Sandwich $ 15012 Pacific Ave S, Tacoma, WA, 98444 98444 47.120297 -122.435306 WA WA American
2105 2106 188 Colectivo Prospect 5.0 103.0 Coffee and Tea, American, Breakfast and Brunch $ 2211 North Prospect Avenue, Milwaukee, WI, 53202 53202 43.059145 -87.885167 WI WI Breakfast
26807 26808 65 sweetgreen (West End) 5.0 102.0 Healthy, Salads $ 2238 M St NW, Washington, DC, 20037 20037 38.905052 -77.049380 DC DC NaN
27964 27965 94 Starbucks (Oakton) 5.0 101.0 Cafe, Coffee &amp; Tea, Breakfast and Brunch, ... $ 2930 Chain Bridge Road, Oakton, VA, 22124 22124 38.882349 -77.299989 VA VA Breakfast
30902 30903 21 Ryan's Bagel Cafe 5.0 99.0 Breakfast and Brunch, American, Sandwiches, Ba... NaN 10261 South 1300 East, Sandy, UT, 84094 84094 40.565180 -111.852960 UT UT Breakfast
22723 22724 138 Starbucks (3347 M Street Nw) 5.0 97.0 Bakery, Breakfast and Brunch, Cafe, Coffee &am... $ 3347 M Street Nw, Washington, DC, 20007 20007 38.905250 -77.067710 DC DC Breakfast
36352 36353 5 Arcadia Wine &amp; Spirits 5.0 94.0 Alcohol, Liquor Stores, Wine $$ 5626 E R L Thornton Fwy, Dallas, TX, 75223 75223 32.790724 -96.745450 TX TX NaN
36753 36754 2 Starbucks (I-45 &amp; 336) 5.0 92.0 Bakery, Breakfast and Brunch, Cafe, Coffee &am... $ 1403 North Loop 336, Conroe, TX, 77304 77304 30.332637 -95.479687 TX TX Breakfast
34444 34445 5 Starbucks (Brownfield &amp; Milwaukee) 5.0 91.0 Bakery, Breakfast and Brunch, Cafe, Coffee &am... $ 5014 Milwaukee, Lubbock, TX, 79407 79407 33.546354 -101.957598 TX TX Breakfast
11204 11205 148 Thai Pod Restaurant 5.0 89.0 Thai $ 2015 NE Broadway St, Portland, OR, 97212 97212 45.535164 -122.645562 OR OR Thai

One thing that stood out to us was the fact that so many Breakfast and American category restaurants were the highest ranking. This is probably because these categories included stores like Starbucks which we know many people frequent everyday. 7 of the 20 top restaurants using this calculation are Starbucks stores which also accounts for all the repeating category types. The remaining categories that make it into the top 20 are Vietnamese, Japanese, and Thai. The stores not labeled with our general categorization are Sweetgreen (salads) and a liquor store.

Moving on, we are taking a look at some stats by price range.

In the price_range column, each location is assigned a dollar-sign value that corresponds to how cheap or expensive the cost if food is there. In order to observe how this data is distributed and how it corresponds to other factors like number of ratings and average rating, we created three plots.

In [ ]:
df['price_range_number'] = df['price_range'].apply(lambda x: 1 if x =='$' else 2 if x == '$$' else 3 if x == '$$$' else 4 if x == '$$$$' else np.nan)

The first is a bar graph which shows the number of restaurants that fall under each price_range ($, $$, $$$, $$$$). From this we can observe that most restaurants in the dataset are $ in price, while there are very few that are $$$ or $$$$ in price. The next is a box and whisker plot showing the relationship between each price category and the number of ratings. From this we can see that the median of the number of ratings is around 50 and is very close across the price ranges, but $ and $$ have a lot of outlier points. The third plot is also a box and whiskers plot that compares ratings across the price ranges. Again, the median across all plots is very similar and all around a 4.6 rating. Again, there are a lot of outliers for price points $ and $$.

In [ ]:
fig, axes = plt.subplots(1,3, figsize=(15, 6))
fig.suptitle('Exploring Restaurant Prices')

sns.countplot(ax=axes[0], data=df, x='price_range_number', palette="Set1").set_title('The Price Categories')

sns.boxplot(ax=axes[1], data=df, x='price_range_number', y = 'number_of_ratings', palette="Set1").set_title('Number of Ratings by Price Category')

sns.boxplot(ax=axes[2], data=df, x='price_range_number', y = 'rating', palette="Set1").set_title('Average Rating by Price Category')
Out[ ]:
Text(0.5, 1.0, 'Average Rating by Price Category')

Category Analysis:

First, let's get the average rating by category. To do this, we group by category, then take the mean of the ratings score, and show the count of the number of ratings used. We also display them in descending order so that we can view it as a ranking.

In [ ]:
df_by_cat_rating = df.groupby('gen_category')['rating'].agg(['mean', 'count'])
#print (df_by_cat_rating.sort_values(by=['mean'], ascending=False))
df_by_cat_rating.sort_values(by=['mean'], ascending=False, inplace=True)
print("Average Rating and Counts of Ratings for Each Category")
df_by_cat_rating.reset_index(inplace=True)
df_by_cat_rating.head(16)
Average Rating and Counts of Ratings for Each Category
Out[ ]:
gen_category mean count
0 Vietnamese 4.708392 286
1 Dessert 4.707273 715
2 Convenience 4.705926 405
3 Asian 4.703976 327
4 Vegetarian 4.699338 151
5 Thai 4.690020 501
6 Korean 4.679143 350
7 Japanese 4.674190 988
8 Breakfast 4.653904 1870
9 Mediterranean 4.629263 475
10 Pizza 4.580267 2022
11 Italian 4.552778 252
12 Chinese 4.546014 1154
13 Mexican 4.522583 2400
14 Indian 4.516926 579
15 American 4.483449 8610

Vietnamese restaurants had the highest rating with Dessert and Convenience in close second and thirds. American had the lowest average rating despite having the most occurrences in the dataset. This may be because with more occurrences, there are more restaurants that can pull the rating up or down and maybe in this case down. There is a variety of American restaurant quality that may be affecting this too. Although this could be said about any of the categories, the sheer number of American restaurants is a factor that must be thought about. Generally, the restaurants with more ratings tend to be lower on the list such as America, Mexican, Chinese, Pizza, and Breakfast.

Let’s also do a visual representation of the average rating by category too.

In [ ]:
plt.figure(figsize=(15,8))
sns.barplot(data=df_by_cat_rating, x="gen_category", y="mean")
plt.ylim(4, 5)
plt.xticks(rotation=90)
plt.title('Average Rating by Category')
plt.ylabel('Average Rating')
plt.xlabel('Category of Restaurant')
Out[ ]:
Text(0.5, 0, 'Category of Restaurant')

From this graph we can see that the average ratings of the categories are actually very close at the top and get wider as the ranking goes down. They are all still generally in the 4.4-4.7 range so no one category sticks out like a sore thumb. Next let's take a look at the count of restaurants by category at each price point.

We break them up into counts as seen below. For better readability lets graph the count of each restaurant category at each price point.

In [ ]:
print (df.groupby(['gen_category','price_range']).size().unstack(fill_value=0))
price_range       $    $$  $$$  $$$$
gen_category                        
American       9685  3871   66     6
Asian           199    93    0     2
Breakfast      1789   934    3     1
Chinese         866   479    4     0
Convenience    1643    12    0     0
Dessert        1098   219   13     2
Indian          430   208    8     1
Italian         208   215    7     0
Japanese        624   394   13     1
Korean          337    80    3     1
Mediterranean   380   177    5     0
Mexican        2473   682    6     0
Pizza          2804   658   10     1
Thai            306   240    2     0
Vegetarian      205    46    1     0
Vietnamese      237    55    0     0
In [ ]:
df['price_range_str'] = df['price_range'].apply(lambda x: 'Low' if x =='$' else 'Medium' if x == '$$' else 'High' if x == '$$$' else 'Extremely High' if x == '$$$$' else np.nan)
In [ ]:
cat_price_ranges = df.groupby(['gen_category','price_range_str']).size().unstack(fill_value=0)
cat_price_ranges.reset_index(inplace=True)
cat_price_ranges.head()
Out[ ]:
price_range_str gen_category Extremely High High Low Medium
0 American 6 66 9685 3871
1 Asian 2 0 199 93
2 Breakfast 1 3 1789 934
3 Chinese 0 4 866 479
4 Convenience 0 0 1643 12
In [ ]:
plt.figure(figsize=(15,8))
sns.barplot(data=cat_price_ranges, x="gen_category", y="Low")
plt.xticks(rotation=90)
plt.title('Number of Restaurants at Low Price Point by Category')
plt.ylabel('Number of Restaurants')
plt.xlabel('Category of Restaurant')

plt.figure(figsize=(15,8))
sns.barplot(data=cat_price_ranges, x="gen_category", y="Medium")
plt.xticks(rotation=90)
plt.title('Number of Restaurants at Medium Price Point by Category')
plt.ylabel('Number of Restaurants')
plt.xlabel('Category of Restaurant')

plt.figure(figsize=(15,8))
sns.barplot(data=cat_price_ranges, x="gen_category", y="High")
plt.xticks(rotation=90)
plt.title('Number of Restaurants at High Price Point by Category')
plt.ylabel('Number of Restaurants')
plt.xlabel('Category of Restaurant')

plt.figure(figsize=(15,8))
sns.barplot(data=cat_price_ranges, x="gen_category", y="Extremely High")
plt.xticks(rotation=90)
plt.title('Number of Restaurants at Extremely High Price Point by Category')
plt.ylabel('Number of Restaurants')
plt.xlabel('Category of Restaurant')
Out[ ]:
Text(0.5, 0, 'Category of Restaurant')

When looking at these graphs, we can see that there is generally the same ratio of restaurants of each category at the differing price points. We always see American restaurants with the most counts and most of the other categories are closer together.

The 2nd, 3rd, and 4th, most occurrences by category change at each price point. At the lowest price point Pizza, Mexican, and Breakfast are 2,3, and 4 respectively. At the medium price point, Breakfast restaurants now have the second highest occurrences beating out Mexican and Pizza who are 3 and 4 respectively. At the high price point, we see the top 2-4 switch a bit. The 2nd most frequent category is now Dessert, the 3rd is Japanese, and the 4th is Pizza.

The extremely high price point only has restaurants from half the categories: American, Asian, Breakfast, Dessert, Indian, Japanese, Korean, and Pizza with the remaining 8 categories having none.

Maps

Since we were provided the longitude and latitude of each of the restaurants, we were able to explore the data further in terms of location. To begin, I created a heat map to show where all of our restaurants were concentrated. This will help us explore the areas that are concentrated more closely. For this code, we use the folium import to create the heat map.

In [ ]:
# creating a heat map to consider the locations our data is based from
locs = []
heatmap = folium.Map(location=[39.8283, -98.5795], zoom_start=4)
for i, loc in df.iterrows():
    if loc['lat'] != 0 and loc['lng'] != 0:
      locs.append((loc['lat'], loc['lng']))

heatmap.add_children(plugins.HeatMap(locs, radius=18))
<ipython-input-35-8e0d01dc0b4f>:8: FutureWarning: Method `add_children` is deprecated. Please use `add_child` instead.
  heatmap.add_children(plugins.HeatMap(locs, radius=18))
Out[ ]:
Make this Notebook Trusted to load map: File -> Trust Notebook

From the heat map, we can see the data points being concentrated in certain areas, and more specifically certain cities. For example, we see data points concentrated in Washington, Utah, Texas, District of Colombia, Alabama, Ohio, Illinois, Vermont, Maryland, Virginia, Idaho, and Oregon. Some of the cities we saw a lot of our data concentrated in was Maryland, DC, Virginia, San Antonio, Seattle, Portland, Salt Lake City, and Burlington. After the analysis of the concentrations, we wanted to visually display where our highest priced restaurants were located. We decided to create a seperate graph with points instead. For this graph, we iterated through the rows of our data frame and plotted the graphs with the highest price range.

In [ ]:
# create a map of where all the highest priced restaraunts are located

# center the map for the US
highest_map = folium.Map(location=[39.8283, -98.5795], zoom_start=4)

for index, row in df.iterrows():
    if row["price_range"] == '$$$$':
        folium.Marker(location=[row['lat'], row['lng']], popup= '$$$$', color='red',
                    icon=folium.Icon(color='red')).add_to(highest_map)
highest_map
Out[ ]:
Make this Notebook Trusted to load map: File -> Trust Notebook

From this map, we can now see the very few points of the highest range prices displayed across the US. We noticed that most of these are in major cities such as Seattle, San Francisco, and DC. We decided to explore this even further with more mapping. In the next few maps, we see maps with data points concentrated in Seattle, Washington, San Francisco, Texas, and DC. For each map, we highlighted the highest pay range restaurants by leaving them in red and as markers instead of circles. To take care of the restaurant entries that, unfortunately, did not have price range data for, we decided to still plot these entries on the graph, but in the color black

In [ ]:
# create a map of restaurants in Seattle, Washington with correcponding price ranges

# filter dataset to restaurants in Washington
wa = ['WA']
waDF = df.loc[df["State"].isin(wa)]


# center the map at Washington DC
wa_map = folium.Map(location=[47.6511, -122.2401], zoom_start=9)

print(waDF['price_range'].unique())
for index, row in waDF.iterrows():
    if row["price_range"] == '$':
        folium.vector_layers.Circle(location=[row['lat'], row['lng']], color='darkblue',
                    icon=folium.Icon(color='darkblue')).add_to(wa_map)
    if row["price_range"] == '$$':
        folium.vector_layers.Circle(location=[row['lat'], row['lng']], color='darkpurple',
                    icon=folium.Icon(color='darkpurple')).add_to(wa_map)
    if row["price_range"] == '$$$':
        folium.vector_layers.Circle(location=[row['lat'], row['lng']], color='green',
                    icon=folium.Icon(color='green')).add_to(wa_map)
    if row["price_range"] == '$$$$':
        folium.Marker(location=[row['lat'], row['lng']], popup= '$$$$', color='red',
                    icon=folium.Icon(color='red')).add_to(wa_map)
    if row["price_range"] == 'nan':
        folium.vector_layers.Circle(location=[row['lat'], row['lng']], color='black',
                    icon=folium.Icon(color='black')).add_to(wa_map)
wa_map
['$' nan '$$' '$$$' '$$$$']
Out[ ]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [ ]:
# create a map of restaurants in San Antonio, Texas with correcponding price ranges

# filter dataset to restaurants in Washington
tx = ['TX']
txDF = df.loc[df["State"].isin(tx)]


# center the map at Washington DC
tx_map = folium.Map(location=[29.5000, -98.4946], zoom_start=11)

print(txDF['price_range'].unique())
for index, row in txDF.iterrows():
    if row["price_range"] == '$':
        folium.vector_layers.Circle(location=[row['lat'], row['lng']], color='darkblue',
                    icon=folium.Icon(color='darkblue')).add_to(tx_map)
    if row["price_range"] == '$$':
        folium.vector_layers.Circle(location=[row['lat'], row['lng']], color='darkpurple',
                    icon=folium.Icon(color='darkpurple')).add_to(tx_map)
    if row["price_range"] == '$$$':
        folium.vector_layers.Circle(location=[row['lat'], row['lng']], color='green',
                    icon=folium.Icon(color='green')).add_to(tx_map)
    if row["price_range"] == '$$$$':
        folium.Marker(location=[row['lat'], row['lng']], popup= '$$$$', color='red',
                    icon=folium.Icon(color='red')).add_to(tx_map)
    if row["price_range"] == 'nan':
        folium.vector_layers.Circle(location=[row['lat'], row['lng']], color='black',
                    icon=folium.Icon(color='black')).add_to(tx_map)
tx_map
['$' nan '$$' '$$$' '$$$$']
Out[ ]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [ ]:
# create a map of restaurants in DC with correcponding price ranges

# filter dataset to restaurants in DC
dc = ['DC']
dcDF = df.loc[df["State"].isin(dc)]


# center the map at Washington DC
pr_map = folium.Map(location=[38.9190, -77.0100], zoom_start=13)

print(dcDF['price_range'].unique())
for index, row in dcDF.iterrows():
    if row["price_range"] == '$':
        folium.vector_layers.Circle(location=[row['lat'], row['lng']], color='darkblue',
                    icon=folium.Icon(color='darkblue')).add_to(pr_map)
    if row["price_range"] == '$$':
        folium.vector_layers.Circle(location=[row['lat'], row['lng']], color='darkpurple',
                    icon=folium.Icon(color='darkpurple')).add_to(pr_map)
    if row["price_range"] == '$$$':
        folium.vector_layers.Circle(location=[row['lat'], row['lng']], color='green',
                    icon=folium.Icon(color='green')).add_to(pr_map)
    if row["price_range"] == '$$$$':
        folium.Marker(location=[row['lat'], row['lng']], popup= '$$$$', color='red',
                    icon=folium.Icon(color='red')).add_to(pr_map)
    if row["price_range"] == 'nan':
        folium.vector_layers.Circle(location=[row['lat'], row['lng']], color='black',
                    icon=folium.Icon(color='black')).add_to(pr_map)
pr_map
[nan '$' '$$' '$$$' '$$$$']
Out[ ]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Something we can conclude from this data is that most of our data entries are in the lower price range, even the bigger cities. But another trend we can see come out of this is that the uber eats data is concentrated around larger populations because the cities we saw concentration in the heat map are very populated. This could tell us something about people in larger cities being more inclined to choose to use a service such as uber eats. But this also might imply something about the regions in which our data is selective towards, and can make us question whether data was more collected from such areas. Something else we can draw our attention to is the locations of our higher priced restaurants. Although the data in general seems to be concentrated along certain states, the highest price range restaurants are almost always within the borders of the big cities in those states. This lets us begin to start thinking about incomes of the people living in these areas, as well as the number of people that populate these areas.

To get specific values for the price range descriptions we already have, we will calculate the average price of menus at each restaurant using the restaurant-menus file and then add that data point to our main dataframe.

In [ ]:
df2 = pd.read_csv('/content/drive/MyDrive/restaurant-menus.csv')

df2.head()
Out[ ]:
restaurant_id category name description price
0 1 Extra Large Pizza Extra Large Meat Lovers Whole pie. 15.99 USD
1 1 Extra Large Pizza Extra Large Supreme Whole pie. 15.99 USD
2 1 Extra Large Pizza Extra Large Pepperoni Whole pie. 14.99 USD
3 1 Extra Large Pizza Extra Large BBQ Chicken &amp; Bacon Whole Pie 15.99 USD
4 1 Extra Large Pizza Extra Large 5 Cheese Whole pie. 14.99 USD
In [ ]:
def find_number(text):
    num = re.findall(r'[0-9]+.[0-9]+',text)
    return " ".join(num)
df2['avg_menu_price']=df2['price'].apply(lambda x: find_number(x))
In [ ]:
df2['avg_menu_price'] = df2['avg_menu_price'].astype(float)
df2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3375211 entries, 0 to 3375210
Data columns (total 6 columns):
 #   Column          Dtype  
---  ------          -----  
 0   restaurant_id   int64  
 1   category        object 
 2   name            object 
 3   description     object 
 4   price           object 
 5   avg_menu_price  float64
dtypes: float64(1), int64(1), object(4)
memory usage: 154.5+ MB
In [ ]:
avg_price = df2.groupby('restaurant_id')['avg_menu_price'].mean().to_frame()
avg_price.reset_index(inplace=True)
avg_price.head()
Out[ ]:
restaurant_id avg_menu_price
0 1 5.663684
1 2 5.505333
2 3 10.762143
3 4 10.531892
4 5 4.532576
In [ ]:
df = pd.merge(df, avg_price, left_on='id', right_on='restaurant_id')
In [ ]:
df.head()
Out[ ]:
id position name rating number_of_ratings category price_range full_address zip_code lat lng State state_by_zip gen_category price_range_number price_range_str restaurant_id avg_menu_price
0 1 19 PJ Fresh (224 Daniel Payne Drive) NaN NaN Burgers, American, Sandwiches $ 224 Daniel Payne Drive, Birmingham, AL, 35207 35207 33.562365 -86.830703 AL AL American 1.0 Low 1 5.663684
1 2 9 J' ti`'z Smoothie-N-Coffee Bar NaN NaN Coffee and Tea, Breakfast and Brunch, Bubble Tea NaN 1521 Pinson Valley Parkway, Birmingham, AL, 35217 35217 33.583640 -86.773330 AL AL Breakfast NaN NaN 2 5.505333
2 3 6 Philly Fresh Cheesesteaks (541-B Graymont Ave) NaN NaN American, Cheesesteak, Sandwiches, Alcohol $ 541-B Graymont Ave, Birmingham, AL, 35204 35204 33.509800 -86.854640 AL AL American 1.0 Low 3 10.762143
3 4 17 Papa Murphy's (1580 Montgomery Highway) NaN NaN Pizza $ 1580 Montgomery Highway, Hoover, AL, 35226 35226 33.404439 -86.806614 AL AL Pizza 1.0 Low 4 10.531892
4 5 162 Nelson Brothers Cafe (17th St N) 4.7 22.0 Breakfast and Brunch, Burgers, Sandwiches NaN 314 17th St N, Birmingham, AL, 35203 35203 33.514730 -86.811700 AL AL American NaN NaN 5 4.532576

Since we saw a somewhat specific pattern of where the higher priced restaurants were located (in the midst of cities), we decided to explore the correlation that average income per zipcode and/or the populations in certain cities have on the population of highest priced restaurants in these cities. We were able to find a cvs from a government website which included the average income by zipcode, which we knew would fit perfectly with our restaurants because those also can be split on zipcode. We imported this data into our notebook. Next, we needed to merge the average income and total population columns into the main dataframe. To do this we need to clean up the zipcode columns so that no special characters were present. We did a regex to keep only the numbers and made it a separate column. To make sure the column types matched up, we converted the zipcode columns to floats since NaN values were not liked by the compiler. We then did a merge of the two datasets by the zipcode columns and the average income and total population columns were added to our main dataset.

In [ ]:
income_df = pd.read_csv("/content/drive/MyDrive/postcode_level_averages.csv")
In [ ]:
avg_income = income_df[["zipcode", "total_pop", "avg_income"]]
avg_income.head()
df['zip_code'] = df['zip_code'].astype(str)


def find_number2(text):
    num = re.findall(r'[0-9]+',text)
    return " ".join(num)
df['zipcode clean']=df['zip_code'].apply(lambda x: find_number2(x).split(' ', 1)[0])
df['zipcode clean']=df['zipcode clean'].apply(lambda x: np.nan if x == '' else x)

df['zipcode clean'] = df['zipcode clean'].astype(float)

avg_income['zipcode'] = avg_income['zipcode'].astype(float)

df4 = pd.merge(df, avg_income, left_on='zipcode clean', right_on='zipcode')
df4 = pd.merge(df, avg_income, left_on='zipcode clean', right_on='zipcode')
df4.head()
<ipython-input-47-f034c56fcfed>:14: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  avg_income['zipcode'] = avg_income['zipcode'].astype(float)
Out[ ]:
id position name rating number_of_ratings category price_range full_address zip_code lat ... state_by_zip gen_category price_range_number price_range_str restaurant_id avg_menu_price zipcode clean zipcode total_pop avg_income
0 1 19 PJ Fresh (224 Daniel Payne Drive) NaN NaN Burgers, American, Sandwiches $ 224 Daniel Payne Drive, Birmingham, AL, 35207 35207 33.562365 ... AL American 1.0 Low 1 5.663684 35207.0 35207.0 3110 26956.913183
1 24 32 Cinnabon baked at Flying J (224 Daniel Payne D... NaN NaN Bakery, Desserts $ 224 Daniel Payne Drive, Birmingham, AL, 35207 35207 33.562260 ... AL Dessert 1.0 Low 24 6.706129 35207.0 35207.0 3110 26956.913183
2 164 59 Cosmic Wings - 5 Points West 3.5 13.0 American, Bar Food, Wings, Fast Food, Chicken,... $ 2246 Bessemer Road, Birmingham, AL, 35207 35207 33.497590 ... AL American 1.0 Low 164 9.341111 35207.0 35207.0 3110 26956.913183
3 171 51 Denny's (224 Daniel Payne Drive N) 3.5 68.0 American, Breakfast and Brunch, Coffee and Tea... $$ 224 Daniel Payne Drive N, Birmingham, AL, 35207 35207 33.562544 ... AL Breakfast 2.0 Medium 171 9.503614 35207.0 35207.0 3110 26956.913183
4 2 9 J' ti`'z Smoothie-N-Coffee Bar NaN NaN Coffee and Tea, Breakfast and Brunch, Bubble Tea NaN 1521 Pinson Valley Parkway, Birmingham, AL, 35217 35217 33.583640 ... AL Breakfast NaN NaN 2 5.505333 35217.0 35217.0 4900 35761.224490

5 rows × 22 columns

Hypothesis Testing

Now that we have the average menu price for the restaurants, the average income per zip code, and the total population per zip code, we have all the information we need to perform a linear regression with this information to explore the idea of it being correlated.

Null Hypothesis: The average income is not correlated to the restaurant prices per zipcode.

We first are exploring the correlation between average income and the price range number. If you recall, each price range for each restaurant was assigned a price range and was ranked from 1-4 (1 being lower priced and 4 being the highest priced). The scatter plot below shows income on the x axis and the price range on the y axis.

In [ ]:
#df4 = df4.drop('zipcode', axis=1)

# use average menu price and average income to get the linear regression model
df4 = df4.loc[df4["avg_income"].notna()]
df4 = df4.loc[df4["price_range_number"].notna()]

x = np.array(df4['avg_income']).reshape(-1,1)
y = np.array(df4['price_range_number'])
regression = linear_model.LinearRegression().fit(x, y)

# extract the slope and y intercept from the linear regression model
m = regression.coef_
b = regression.intercept_

# display information
print("Slope: ", m)
print("Y-intercept: ", b)
print("Linear Regression Model: ", m, "x", b)
Slope:  [-1.55019176e-08]
Y-intercept:  1.280730740137753
Linear Regression Model:  [-1.55019176e-08] x 1.280730740137753
In [ ]:
# create the basic scatter plot
plt.plot(x, y, 'o')
plt.plot(x, m*x+b, color = 'orange')
Out[ ]:
[<matplotlib.lines.Line2D at 0x7f547ca4c940>]
In [ ]:
import statsmodels.api as sm
# do the linear regression
x = df4['avg_income']
y = df4['price_range_number']
 
x = sm.add_constant(x)
 
model = sm.OLS(y, x).fit()
predictions = model.predict(x) 
 
#print out the model summary
print_model = model.summary()
print(print_model)
                            OLS Regression Results                            
==============================================================================
Dep. Variable:     price_range_number   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                 -0.000
Method:                 Least Squares   F-statistic:                    0.1107
Date:                Fri, 16 Dec 2022   Prob (F-statistic):              0.739
Time:                        20:23:56   Log-Likelihood:                -21308.
No. Observations:               32947   AIC:                         4.262e+04
Df Residuals:                   32945   BIC:                         4.264e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.2807      0.005    259.983      0.000       1.271       1.290
avg_income  -1.55e-08   4.66e-08     -0.333      0.739   -1.07e-07    7.58e-08
==============================================================================
Omnibus:                     5131.186   Durbin-Watson:                   1.642
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             7969.906
Skew:                           1.201   Prob(JB):                         0.00
Kurtosis:                       3.200   Cond. No.                     2.05e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.05e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
/usr/local/lib/python3.8/dist-packages/statsmodels/tsa/tsatools.py:142: FutureWarning: In a future version of pandas all arguments of concat except for the argument 'objs' will be keyword-only
  x = pd.concat(x[::order], 1)
In [ ]:
#df4 = df4.drop('zipcode', axis=1)

# use average menu price and average income to get the linear regression model
x1 = np.array(df4['avg_income']).reshape(-1,1)
y1 = np.array(df4['avg_menu_price'])
regression = linear_model.LinearRegression().fit(x1, y1)

# extract the slope and y intercept from the linear regression model
m1 = regression.coef_
b1 = regression.intercept_

# display information
print("Slope: ", m1)
print("Y-intercept: ", b1)
print("Linear Regression Model: ", m1, "x", b1)
Slope:  [1.06466071e-05]
Y-intercept:  8.949484155637846
Linear Regression Model:  [1.06466071e-05] x 8.949484155637846
In [ ]:
# create the basic scatter plot
plt.plot(x1, y1, 'o')
plt.plot(x1, m1*x1+b1, color = 'orange')
Out[ ]:
[<matplotlib.lines.Line2D at 0x7f5477e94520>]
In [ ]:
import statsmodels.api as sm
# do the linear regression
x1 = df4['avg_income']
y1 = df4['avg_menu_price']
 
x1 = sm.add_constant(x1)
 
model = sm.OLS(y1, x1).fit()
predictions = model.predict(x1) 
 
#print out the model summary
print_model = model.summary()
print(print_model)
                            OLS Regression Results                            
==============================================================================
Dep. Variable:         avg_menu_price   R-squared:                       0.008
Model:                            OLS   Adj. R-squared:                  0.008
Method:                 Least Squares   F-statistic:                     268.6
Date:                Fri, 16 Dec 2022   Prob (F-statistic):           4.06e-60
Time:                        20:23:56   Log-Likelihood:            -1.0813e+05
No. Observations:               32947   AIC:                         2.163e+05
Df Residuals:                   32945   BIC:                         2.163e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          8.9495      0.069    130.282      0.000       8.815       9.084
avg_income  1.065e-05    6.5e-07     16.388      0.000    9.37e-06    1.19e-05
==============================================================================
Omnibus:                    35430.126   Durbin-Watson:                   1.886
Prob(Omnibus):                  0.000   Jarque-Bera (JB):          4542504.557
Skew:                           5.292   Prob(JB):                         0.00
Kurtosis:                      59.542   Cond. No.                     2.05e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.05e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
/usr/local/lib/python3.8/dist-packages/statsmodels/tsa/tsatools.py:142: FutureWarning: In a future version of pandas all arguments of concat except for the argument 'objs' will be keyword-only
  x = pd.concat(x[::order], 1)

Analysis

As we can see, there is little to no correlation between the average income and price_range_number. The slope we calculate is extremely small and negative (-1.55019176e-08). But to make a claim of our null hypothesis based on these x and y values, we discovered that the p value of the correlation coefficient is 0.739. Since p is not less than .05, we cannot reject our null hypothesis based on the variables used in this scatter plot to show the relationship.

But since we had averaged out the prices per restaurant individually, we decided to see if that would make a difference in how these characteristics are correlated. For this next scatter plot, we use average income as the x axis and used average menu price as the y axis.

This resulted in a linear regression with a slope of 1.06466071e-05, which showed more correlation than the last linear regression we did. This allows us to move further with our analysis. Using the same x and y values, we were able to calculate the p value as well which ended up being 0.000. Since this p value is less than .05, we are able to reject our null hypothesis of average income and average prices at restaurant being uncorrelated.

Moving to Classifications, we wanted to see how well the average income and populations for every zipcode would be able to predict whether a price is higher or lower. Since we needed a binary target, we decided to tackle this first. We split the 4 ranges we had already calculated (price_range) into whether or not it is a higher price. I created another column to our dataframe called is_highest. For this column, if the price range was the lower two price ranges, we say no. If it is from the higher two price ranges, we say yes.

In [ ]:
# make another coloumn called 'is_higher', yes for $$$$, else no
temp = []

for index, row in df4.iterrows():
  if row['price_range'] == '$$$$' or row['price_range'] == '$$$':
    temp.append('Yes')
#   print("here")

  else:
    temp.append('No')
  
df4["is_higher"] = temp

Using this, we perform a KNN Classification and Random Forrest Classification. We use average income and population as the predicting features, and target will be is_higher (basically predicting is the restaurant is a higher priced restaurant based on the population and average score of its location). I will use this data to train a portion to be able to predict whether or not the restaurant is priced higher or not, and use the other portion to test how accuratley the data was trained. For both classifiers, the data is first split into data for training vs testing. The K Nearest Neighbor and the Random Forest algorithms are first ran on the training set (data is fitted) and then uses the test category of the data to predict whether the restaurant is higher priced or not. I used the 10-fold cross-validation method to measure how accuratley the algorithm predicted the correct class (Yes/No).

In [ ]:
# Classification

# prepare data into dataframe
data = df4[['total_pop', 'avg_income']].copy()
target = df4[['is_higher']].copy()

# K-NN Classification
x_train, x_test, y_train, y_test = train_test_split(data, target)
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(x_train, y_train)
knn.score(x_test, y_test)

# 10-fold cross validation
cvs = cross_val_score(knn, data, target, cv = 10)

# average accuracy across all the splits
print("K-NN Classification Average Accuracy: " + str(cvs.mean()))
print("K-NN Classification Standard Error: " + str(stats.sem(cvs)))
/usr/local/lib/python3.8/dist-packages/sklearn/neighbors/_classification.py:198: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  return self._fit(X, y)
/usr/local/lib/python3.8/dist-packages/sklearn/neighbors/_classification.py:198: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  return self._fit(X, y)
/usr/local/lib/python3.8/dist-packages/sklearn/neighbors/_classification.py:198: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  return self._fit(X, y)
/usr/local/lib/python3.8/dist-packages/sklearn/neighbors/_classification.py:198: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  return self._fit(X, y)
/usr/local/lib/python3.8/dist-packages/sklearn/neighbors/_classification.py:198: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  return self._fit(X, y)
/usr/local/lib/python3.8/dist-packages/sklearn/neighbors/_classification.py:198: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  return self._fit(X, y)
/usr/local/lib/python3.8/dist-packages/sklearn/neighbors/_classification.py:198: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  return self._fit(X, y)
/usr/local/lib/python3.8/dist-packages/sklearn/neighbors/_classification.py:198: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  return self._fit(X, y)
/usr/local/lib/python3.8/dist-packages/sklearn/neighbors/_classification.py:198: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  return self._fit(X, y)
/usr/local/lib/python3.8/dist-packages/sklearn/neighbors/_classification.py:198: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  return self._fit(X, y)
/usr/local/lib/python3.8/dist-packages/sklearn/neighbors/_classification.py:198: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  return self._fit(X, y)
K-NN Classification Average Accuracy: 0.9733834727784826
K-NN Classification Standard Error: 0.011979384877157192
In [ ]:
# Random Forest Classification
x_train, x_test, y_train, y_test = train_test_split(data, target)
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(x_train, y_train)
rfc.predict(x_test)
rfc.score(x_test, y_test)

# 10-fold cross validation
cvs2 = cross_val_score(rfc, data, target, cv = 10)

# average accuracy across all the splits
print("Random Forest Classification Average Accuracy: " + str(cvs2.mean()))
print("Random Forest Classification Standard Error: " + str(stats.sem(cvs2)))
<ipython-input-56-872101601dfa>:4: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  rfc.fit(x_train, y_train)
/usr/local/lib/python3.8/dist-packages/sklearn/model_selection/_validation.py:680: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  estimator.fit(X_train, y_train, **fit_params)
/usr/local/lib/python3.8/dist-packages/sklearn/model_selection/_validation.py:680: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  estimator.fit(X_train, y_train, **fit_params)
/usr/local/lib/python3.8/dist-packages/sklearn/model_selection/_validation.py:680: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  estimator.fit(X_train, y_train, **fit_params)
/usr/local/lib/python3.8/dist-packages/sklearn/model_selection/_validation.py:680: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  estimator.fit(X_train, y_train, **fit_params)
/usr/local/lib/python3.8/dist-packages/sklearn/model_selection/_validation.py:680: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  estimator.fit(X_train, y_train, **fit_params)
/usr/local/lib/python3.8/dist-packages/sklearn/model_selection/_validation.py:680: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  estimator.fit(X_train, y_train, **fit_params)
/usr/local/lib/python3.8/dist-packages/sklearn/model_selection/_validation.py:680: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  estimator.fit(X_train, y_train, **fit_params)
/usr/local/lib/python3.8/dist-packages/sklearn/model_selection/_validation.py:680: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  estimator.fit(X_train, y_train, **fit_params)
/usr/local/lib/python3.8/dist-packages/sklearn/model_selection/_validation.py:680: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  estimator.fit(X_train, y_train, **fit_params)
/usr/local/lib/python3.8/dist-packages/sklearn/model_selection/_validation.py:680: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  estimator.fit(X_train, y_train, **fit_params)
Random Forest Classification Average Accuracy: 0.8999370446841777
Random Forest Classification Standard Error: 0.036725425918588524

Analysis

We can conclude that, with the parameters we used, K-NN Classification Average had a better prediction performance than the Random Forest Classification because K-NN Classification had an average of being accurate 97.33% of the time while Random Forest Classification had an average of being accurate 89.99% of the time. But overall, because both classifications had relatively well predictions, population and average income for zipcodes did a relatiely good job predicting whether a restaurant has a price range on the higher end.

Conclusion

Based on our analysis of the Uber Eats data, there are some insights we can state. Comparing the restaurants that were priced on different scales was difficult, so we wanted to explore some of the reasons for the distribution of where the higher priced restaurants were placed. We thought income in certain areas might be a very significant factor on where the higher priced restaurants are located compared to those that are lower priced. But it turns out that the correlation is minimal. Once we calculated the averages for ourselves of what the average price was for each restaurants instead of just using a range that the data set already provided, we could reject the null hypothesis that these characteristics were uncorrelated. We also were able to see that using population in addition to average income per area worked relatively well in training our data set to predict whether the restaurant was higher or lower priced. This shows us that having both of these predictors would create more correlation than just using average income to determine price. Overall, this data on Uber Eats was able to provide us with distribution, ratings and prices of popular restaurants in many popular cities throughout the United States. We were able to use this information to explore factors that might go into the pricing and locations of these restaurants such as average income per zipcode and populations of the zipcodes in question.